Skip to content

Design of #1346#2655

Open
krizhanovsky wants to merge 6 commits into
masterfrom
em-1346-design
Open

Design of #1346#2655
krizhanovsky wants to merge 6 commits into
masterfrom
em-1346-design

Conversation

@krizhanovsky

Copy link
Copy Markdown
Contributor

This PR is copy of the 1346 design proposal that we can all discuss in comments and work out a final design.

@krizhanovsky krizhanovsky left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a lot of comments and questions

Comment thread 1346-design.md
u64 pending_cpu;
} TfwCpuEma;
```
Save time at the beginning of SoftIRQ shot and check CPU usage at the end of SoftIRQ shot (to prevent perfomance regression in case when we do it on each request) .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this begin_time?

In one softirq shot we process many requests - how can we apply begin_time to all of them?

I think begin_time should be the time of receiving an skb. We can save the time somewhere (e.g. in a static per-cpu variable) - when we get an skb we do not know the client. But we need to call tfw_client_update_cpu_ema() not only on forwarding an HTTP message, but also on error responses. At all these calls we should know the socket and TfwClient.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can do it same as we do it for client_mem. We save begin_time at the beginning of ss_tcp_process_data and check at the connection_recv_finish callback. For client mem we do it to prevent performance degradation, I think for CPU we can do the same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Socket is known in ss_tcp_process_data, we can get client from sk_user_data (connection)->client same as we do for client mem.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For responses the different case we can do it in process_resp function same as we do for client_mem

Comment thread 1346-design.md
* The structure is used to accumulate execution time deltas and maintain
* a smoothed estimate (EMA) of CPU consumption.
*
* @last_ts - timestamp of the last update (in ns). Used to compute

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it expensive to get time with ns accuracy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to change it jiffies and check is it ok.

Comment thread 1346-design.md
In addition to `TfwTrainingStat` implement structure and per-cpu array of this structures.
```C
/**
* Exponential moving average (EMA) tracker for per-CPU time usage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I think EMA should work well here

Comment thread 1346-design.md
Pass `delta = new_ema - prev_ema` to `tfw_client_training_adjust_cpu_num` which do the same as ` `tfw_client_training_adjust_req_num`.

**Defence mode**
In defence mode use `delta_ema` on each SoftIRQ shot to calculate `z = (delta_ema - mean) / std` and if calculated `z > threshold` reject connection with TCP RST and block client by IP if necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say we process requests for 1K clients in one SoftIRQ shot, then all of them will use the same begin_time and different now timestamp - the last client processed has the larges CPU time and the first one the lowest. This a computation bug.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All requests belongs to the same client (we process only one socket during ss_tcp_process_data ). Yes it is not accuracy, but we do the same for client_mem to prevent performance degradation.

Comment thread 1346-design.md
* - computes elapsed time (@dt);
* - converts accumulated CPU time into normalized usage value;
* - applies time-based decay (older history loses weight);
* - updates EMA using a combination of decay and smoothing factor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 bulets above are just the idea of EMA, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment thread 1346-design.md Outdated
unsigned int epoch;
} TfwTrainingStat;
```
We use new implemented function `tfw_client_training_adjust_req_num` both for training and defence mode.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what the function does?

Comment thread 1346-design.md

**Training mode**
`conn_curr` is incremented/decremented.
Track maximum concurrent connections (`conn_max`). When max increases - compute `delta1 = new_max - old_max` and `delta2 = new_max² - old_max²` and use this values to update `sum` and `sumsq`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sum and sumsq are? If it is mean and standard deviation, then the
computation is wrong for Welford:

n += 1
delta  = new_max - mean
mean  += delta / n
delta2 = new_max - mean
M2    += delta * delta2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta = curr - old_max;
s->sum += delta (per cpu)
n (not always +1), because we increment it only when the client is new!
total_sum = percpu_counter_sum(&s->sum);
s->mean = (total_sum << SCALE_SHIFT) / num_clients;

Comment thread 1346-design.md Outdated
Track `curr` - current in-flight non-idempotent requests. Increment `curr` in `tfw_http_req_enlist`, decrement in `tfw_http_req_nip_delist`. Also track `max` maximum count in-flight non-idempotent requests per client. When max increases update global trainging stats, same as we do it for connections (`delta1 = new_max - old_max` and `delta2 = new_max² - old_max²`).

**Defence mode**
Change signature for `tfw_http_req_enlist` from `void` to `bool`. Call `tfw_client_training_adjust_req_num` on each new non-idempotent request, calculate z-score, return false if `z > threshold`. `tfw_http_req_enlist` is called from `tfw_http_req_fwd` and `tfw_http_req_fwd_resched`, this functions now return T_BLOCK if `tfw_http_req_enlist` fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we compute z-score only on training mode. In defence (protection) mode we only
compare computed value with the current number of indempotent requests in flight.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is important note. It implies that in training mode we can only compute local values and merge them when we finished processing of the current client. We can use per-cpu counters. But in defence mode we can collect and sum all per-cpu counters in the beginning of processing the client and cache it to compare with z-score even for each request.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think we calculate mean and std at the end of the training mode (during switching to defence mode).
Then on each new value we calculate *z_score = ((s64)(val << SCALE_SHIFT) - s->mean) / s->std;
and compare it with configured threshold

Comment thread 1346-design.md Outdated
Callers of `tfw_http_req_fwd` and `tfw_http_req_fwd_resched` send 403 error response, drop client connection with TCP RST and block client by IP if these functions return T_BLOCK.

**Epoch handling**
Each request tagged with `training_epoch` to prevent mixing old and new training data (we add new field to `request` structure and save epoch in this field). When request removed from server connection queue we don't update `curr` field in case when request belongs to previous epoch. (When request added to server connection queue it always belongs to new epoch if trainging enabled!).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? IIUC this is for the case when net.tempesta.training is
changed many times, i.e. there are many transitions between training and protection
modes (maybe with disabled as well). It seems this is a sophistication just to not to
start training from absolute zero, but use requests in flight. Probably, this is not
so big win to make the sophistication, at least in the first implementation.

Comment thread 1346-design.md
**Current method and alternatives**
The same problems and altgernatives as for connections.

**CPU Tracking**

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have cheap and precise enough nanoseconds time, then the current proposal should work. Meantime, I want to propose an alternate or additional change to block malicious users by CPU consumption.

Rework http_body_chunk_cnt and http_header_chunk_cnt limits as it's hard to unify the values for many-headers long messages and short-headers short-body messages.

Instead we need to detect artificially lowered chunk sizes for HTTP/1 and DATA and CONTINUATION frames for HTTP/2.

We can do this with learning average DATA and CONTINUATION frame size in HTTP/2 and/or data chunk (skb-carried, not HTTP chunk) for both HTTP/1 and HTTP/2.

We should accounb the average (for training and protection modes) ONLY for multi-chunk messages. I.e. if a message has zero or 1 CONTINUATION or DATA, then we do not compute the average for it.

We learn and analyze average chunk size, where chunk is a CONTINUATION or DATA frame size for HTTP/2 or skb data chunk in HTTP/1 (not an HTTP chunk size). It's is essentially total_size / chinks_number, where total_size is the total body or headers size.

The average chunk size is about kilobyte, maybe several kilobytes (with GRE) and we need to catch extremely small chunk sizes. Not only that (it's probably OK to have several occasional small chunks), but when a client sends a lot of small chunks. I.e. I propose to learn and detect multiplication of N / average_chunk_size * chunks_number - this feature should have high deviations for normal and attacking connections.

In comparison with the current http_body_chunk_cnt and http_header_chunk_cnt limits:

  1. this scheme normally handles small messages consisting of 1 small chunk
  2. normally handles large messages consisting of many large chunks
  3. trigger on large messages consisting on many small chunks
  4. these parameters are learnt from traffic and we don't need to specify the hard-to-define limits

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nano seconds why not cpu cycles?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on today call that this solution would be only for HTTP/2 framing attacks, not a generic CPU attacks in sense of #488 (e.g. think about ReDDoS or parser-specific attacks)

Comment thread 1346-design.md
cpu_ema->ema = cpu_ema->ema *
((1 << SCALE_SHIFT) - decay) >> SCALE_SHIFT;
cpu_ema->ema += ((s64)usage - (s64)cpu_ema->ema) >> ema_alpha_shift;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math is always easy to make wrong computations, so strongly propose to start from a unit test showing the algorithm behavior on different data, see for example t/unit/user_space/percentiles.c

@EvgeniiMekhanik EvgeniiMekhanik force-pushed the em-1346-design branch 2 times, most recently from 6a95185 to ff306b4 Compare June 5, 2026 10:36
- Describe algorith used in training/defence mode for client
  connections.
- Add benchmark to compare algorithm and describe why current
  algorithm is choosen.
- Add accuracy comparison programm to show that both algorithm
  demonstrate the same accuracy.
- Describe algorith used in training/defence mode for non
  idempodent requests.
Comment thread 1346-design.md Outdated
* reinitialization of @max and @counter.
*/
typedef struct tfw_client_req_counter_t {
struct tfw_client_req_counter_t *next_free;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get this, it will be union?

Comment thread 1346-design.md Outdated

delta1 = curr - old_max;
delta2 = (u64)curr * curr - (u64)old_max * old_max;
tfw_training_mode_adjust_req_num(delta1, delta2);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thing is unclear for me. tfw_training_mode_adjust_req_num() adjust per-client sum and sumsq?

- Describe algorith used in training/defence mode for
  client memory usage tracking.
- Rework algorithm used for tracking non idempotent
  requests (now we use common algorithm both for non
  idempotent requests and memory usage tracking).
- Some fixes in document.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants